import pandas
!pip install statsmodels
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-illness.csv
# Convert it into a table using pandas
dataset = pandas.read_csv("doggy-illness.csv", delimiter="\t")
# Print the data
print(dataset)
Requirement already satisfied: statsmodels in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: pandas>=0.21 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.1.5)
Requirement already satisfied: numpy>=1.14 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.21.6)
Requirement already satisfied: patsy>=0.5 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (0.5.2)
Requirement already satisfied: scipy>=1.0 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.5.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2022.1)
Requirement already satisfied: six in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from patsy>=0.5->statsmodels) (1.16.0)
--2023-08-23 12:43:36-- https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21511 (21K) [text/plain]
Saving to: ‘graphing.py’
graphing.py 100%[===================>] 21.01K --.-KB/s in 0s
2023-08-23 12:43:36 (94.3 MB/s) - ‘graphing.py’ saved [21511/21511]
--2023-08-23 12:43:38-- https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-illness.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3293 (3.2K) [text/plain]
Saving to: ‘doggy-illness.csv’
doggy-illness.csv 100%[===================>] 3.22K --.-KB/s in 0s
2023-08-23 12:43:38 (46.5 MB/s) - ‘doggy-illness.csv’ saved [3293/3293]
male attended_training age body_fat_percentage core_temperature \
0 0 1 6.9 38 38.423169
1 0 1 5.4 32 39.015998
2 1 1 5.4 12 39.148341
3 1 0 4.8 23 39.060049
4 1 0 4.8 15 38.655439
.. ... ... ... ... ...
93 0 0 4.5 38 37.939942
94 1 0 1.8 11 38.790426
95 0 0 6.6 20 39.489962
96 0 0 6.9 32 38.575742
97 1 1 6.0 21 39.766447
ate_at_tonys_steakhouse needed_intensive_care \
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
.. ... ...
93 0 0
94 1 1
95 0 0
96 1 1
97 1 1
protein_content_of_last_meal
0 7.66
1 13.36
2 12.90
3 13.45
4 10.53
.. ...
93 7.35
94 12.18
95 15.84
96 9.79
97 21.30
[98 rows x 8 columns]
We have a variety of information, including what the dogs did the night before, their age, whether they're overweight, and their clinical signs.
In this exercise, our y values, or labels, are represented by the core_temperature column, while our feature will be the age in years.
Let's have a look at how the features and labels are distributed.
import graphing
graphing.histogram(dataset, label_x='age', nbins=10, title="Feature", show=True)
graphing.histogram(dataset, label_x='core_temperature', nbins=10, title="Label")
Looking at our feature (age), we can see dogs were at or less than 9 years of age, and ages are evenly distributed. In other words, no particular age is substantially more common than any other.
Looking at our label (core_temperature), most dogs seem to have a slightly elevated core temperature (we would normally expect ~37.5 degrees celcius), which indicates they're unwell. A small number of dogs have a temperature above 40 degrees, which indicates they're quite unwell.
Simply because the shape of these distributions is different, we can guess that the feature won't be able to predict the label extremely well. For example, if old age perfectly predicted who would have a high temperature, then the number of old dogs would exactly match the number of dogs with a high temperature.
The model might still end up being useful, though, so lets continue.
The next step is to eyeball the relationship. Let's plot relation between the labels and features.
graphing.scatter_2D(dataset, label_x="age", label_y="core_temperature", title='core temperature as a function of age')
It does seem that older dogs tended to have higher temperatures than younger dogs. The relationship is quite "noisy," though; many dogs of the same age have quite different temperatures.
Let's formally examine the relationship between our labels and features by fitting a line (simple linear-regression model) to the dataset.
import statsmodels.formula.api as smf
import graphing # custom graphing code. See our GitHub repo for details
# First, we define our formula using a special syntax
# This says that core temperature is explained by age
formula = "core_temperature ~ age"
# Perform linear regression. This method takes care of
# the entire fitting procedure for us.
model = smf.ols(formula = formula, data = dataset).fit()
# Show a graph of the result
graphing.scatter_2D(dataset, label_x="age",
label_y="core_temperature",
trendline=lambda x: model.params[1] * x + model.params[0]
)
The line seems to fit the data quite well, validating our hypothesis that there's a positive correlation between a dog's age and their core temperature.
Visually, simple linear regression is easy to understand. Let's recap on what the parameters mean, though.
print("Intercept:", model.params[0], "Slope:", model.params[1])
Intercept: 38.087867548892106 Slope: 0.15333957754731825
Remember that simple linear regression models are explained by the line intercept and the line slope.
Here, our intercept is 38 degrees celsius. This means that when age is 0, the model will predict 38 degrees.
Our slope is 0.15 degrees celsius, meaning that for every year of age, the model will predict temperatures 0.15 degrees higher.
In the following box, try to change the age to a few different values to see different predictions, and compare these with the line in the preceding graph.
def estimate_temperature(age):
# Model param[0] is the intercepts and param[1] is the slope
return age * model.params[1] + model.params[0]
print("Estimate temperature from age")
print(estimate_temperature(age=0))
Estimate temperature from age 38.087867548892106
We covered the following concepts in this exercise: